C/3 



On Bayesian Curve Fitting Via Auxiliary 



o ■ Variables 
o 

(N 

O ! Y. Fan 



School of Mathematics and Statistics 
University of New South Wales, Sydney 2052, Australia 

J.-L. Dortet-Bernadet 

Institut de Recherche Mathematique Avancee, UMR 7501 CNRS 



^ I Universite de Strasbourg, Strasbourg, France 

Qs'. S. A. Sisson 

oo 

School of Mathematics and Statistics 
' University of New South Wales, Sydney 2052, Australia 

o ■ 



November 11, 2009 



Abstract 



In this article we revisit the auxiliary variable method introduced in Smith and Kohn (1996) 
for the fitting of P-th order spline regression models with an unknown num- 
ber of knot points. We introduce modifications which allow the location of 
knot points to be random, and we further consider an extension of the method 
to handle models with non-Gaussian errors. We provide a new algorithm for 
the MCMC sampling of such models. Simulated data examples are used to 
compare the performance of our method with existing ones. Finally, we make 
a cormection with some change-point problems, and show how they can be 
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re-parameterised to the variable selection setting. 

Supplemental materials including R computing codes used in the examples 
are available online. 

Keywords: Change-point; Curve fitting; Gibbs sampling; Markov chain Monte 
Carlo; Splines; Variable selection. 



1 Introduction 

This article examines methods for Bayesian curve fitting. Specifically, given 
observation pairs (xi, yi), . . . , ?/„), we are interested in fitting the regres- 
sion model 

Yi\xi, ...,Xn ^ f{xi) + ei, i = l,...,n (1) 

where ei, e„ are independent draws from a Gaussian distribution A^(0, cj^) 
with cr > unknown. The curve /, about which we wish to make inference, 
is a smooth real-valued function defined on some interval [a, b]. Later in this 
article, we consider the case where the Gaussian error assumption is relaxed. 

A general and powerful non-parametric approach to the fitting of the curve 
/, is via spline functions of a given degree, P > 1. In this setting, / can be 
written as the linear combination 

P K 

f{x) = ao + ^ajx^ +'^r]k{x --/k)+, xe[a,b] (2) 
j=i k=i 

where z+ = max{0, z} and jk, k = 1, . . . , K represent the locations of K knot 
points (see lHastie and Tibshirani 19901) . Typically, the degree P is set to equal 
3, as cubic splines are known to approximate locally smooth function arbitrar- 
ily well. Under the representation fitting the curve consists of estimating 
the number of knots K, the knot locations 7fc, ^ = I, ■ . . ,K, and the corre- 
sponding regression coefficients aj, j = 0, . . . , P and 7]^, k = 1,. . . ,K. See 



Ruppert et al. (2003) for an accessible exposition on non-parametric regression 



models using splines. 



Several authors provide methods for Bayesian inference on this model. 
One such method is to introduce a large number of potential knots, each with a 
fixed location, from which a significant subset can be selected (e.g. [Friedman and Silverman 1989|l . 
If 7fc, A; = 1, Kmax represent Kmax known potential knots, model ^ can be 
written as the linear model 

Y = X^p + e (3) 

where y = (yi, . . . , y^)', /? = (ao, ai, . . . ,ap,r]i, . . . ,r]KmaJ' ' e = (ei, . . . , e^)', 
with design matrix 

= (l,x, . . . ,x^, (x - 171)^, . . . , (x - l7if^aJ+) 



where x = {xi,. . . , and 1 = (1, . . . , 1)'. [Smith and Kohn (1996) recog- 



nised that a Bayesian variable selection technique (e.g. George and McCuUoch 1993[ ) 



can be used to carry out inference on the curve. The variable selection they 
proposed only requires a Gibbs sampler (|Gelfand and Smith 1990|l , thus the 
curve fitting procedure is relatively straightforward and easy to compute. For 



the selection of the best potential knots, |Denison et al. (1998)] proposed to use 
a reversible jump Markov chain Monte Carlo (MCMC) algorithm (|Green 19951 
ISisson 2005|) : their method avoids computation of the spline coefficients rjk by 



substituting their least squares estimates. [Biller (2000) [ provides an alternative 
reversible jump MCMC algorithm extending to the case for non-Gaussian er- 
rors. 

All these methods are very efficient in practice for many types of appli- 
cations. Nevertheless, in some cases, the need to define the discrete set of 
candidate knots can become a limitation. A common procedure is to use 
some of the sorted distinct values of the potential knots; for exam- 



ple. Smith and Kohn (1996) recommend placing a potential knot each three 
to five sorted Xi values when using cubic splines. Clearly, this can be prob- 
lematic when the .x/s are non-regular ly spaced. A solution consists of the 



placement of knots from a continuous proposal, as in DiMatteo et al. (2001) 



who extended the approach taken by [Denison et al. (1998) to a fully Bayesian 



treatment. They used conjugate priors for the regression parameters and in- 
tegrated them out of the posterior, and proposed a reversible jump MCMC 



sampler that runs only on the number and locations of the knots. 

In this article we consider a generalization of the auxiliary variable method. 



first introduced in similar context by Smith and Kohn (1996) that allows the 
location of the potential knots to be unknown. This generalization is based 
on the introduction into the model of intervals in which the potential knots 
may lie. The proposed method is expected to offer a better fit of the curve to 
the data, since we consider knots from a continuous space, while retaining the 
simplicity of a Metropolis-within-Gibbs sampler for inference on the model. 
More precisely, we give in Section|2]the general set up for our modelling strat- 
egy and discuss how inference is carried out. Section |3] extends the auxiliary 
variable modelling approach to the more general setting where we have non- 
Gaussian errors, and suggests a new algorithm for MCMC sampling. In Sec- 
tion|4]we revisit some change-point detection problems and see that the use of 
an auxiliary variable setting is beneficial from a computational point of view. 
Finally, we conclude with some discussion in Section |5l 



2 Curve fitting via an auxiliary variable approach 
2.1 The model and prior assumptions 

We adopt an auxiliary variable approach by introducing a vector of binary 
indicator variables Zk,k = 1, . . . , Kmax, 

{1 if there is a knot point 7^ in the interval Ik and i]k / 
if there is no knot point in the interval 1^ and rjf^ = 

where r/^ denotes the spline coefficients in model Q, and the intervals Ik are 
defined on the range of the x/s. Each interval 1^ contains at most one knot 
with unknown location ^y^. In practice, such intervals can be defined either 
using prior information on regions where a knot is suspected or, in the absence 
of such prior information, an equal partition of the range may be adopted. We 
denote the vector (71, ... , jKmaxY by 7 and consider the product of uniform 
distributions on the interval as the prior distribution on 7. 
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Each possible value for 7 gives a model of the form Let X^^-y denote the 
matrix constructed with the columns of corresponding to non-zero entries 
in z, and let /?^^^ denote the vector of corresponding regression coefficients. 
We use the following decomposition of the joint prior distribution of all the 
unknown parameters 

vr(/3^,7, z, o■^ 7) = 7r^^_^(/3^,^|z, cr^ 7)7r^2 (cr2)7r^(z)7r^(7), 

where 

7rp^JP,,^\z,a\^) = iV(0,a2c(X;^X,,^)-i). (4) 

This conditional prior for (3^.-^ related to (7-priors (|Zellner 1986|) . has the ad- 
vantage of conjugacy when e is Gaussian, in which case the regression and 
variance parameters can be analytically integrated out. The case c = n corre- 
sponds to the unit information prior used by [DiMatteo et al. (2001) a default 



choice that has worked well in practice with large sample sizes. Smith and Kohn (1996) 
recommend values of c in the range 10 < c < 1000. For the variance parameter, 
we use the classical uninformative prior 11^2 {a'^) oc l/cr^ that leads to proper 
posteriors here (see for example iGelman et al. 20031 Chapter 2). Finally, we 
need to define the prior distribution for z. We consider here the decomposi- 
tion of this prior given by 

■Kz{z) = Tr{z I |z|)7r(|z|) 

where j^;] = J2k=i^ is the number of non-zero entries in z, i.e. the number of 
knots that are used in the corresponding model. We use as prior for \z\ a right- 
truncated Poisson distribution with parameter A, and maximum value L. The 
value of L < Kmax corresponds to the maximum number of knots allowed. 
We assume that, given the quantity L, all possible configurations for z have 
equal probabilities, so that 

■^z{z) OC -^l{\z\<L}, (5) 

where is 1 if ^ is true and otherwise. Under these prior and Gaussian 
error assumptions, the parameters /J^^^ and are easily integrated out of the 



posterior distribution. We finally get the joint posterior distribution for (z, 7) 
of the form 

7r(z, 7|y) oc (c + l)-(l^l+^+^)/25,,^(y)-"/V,(z)7r^(7) (6) 

where 

!^.JY) = Y'Y - 

c+1 



2.2 Inference on the posterior distribution 

An MCMC sampler is used for the inference on the model. Based on the pos- 
terior distribution it uses the following successive updates for z and 7: 

• Update z. This update involves two types of moves; with probability 
0.5 we propose an add/delete step, otherwise a swap step is proposed. 
Specifically, the two move steps involve 

- add /delete: randomly select a z^ and propose to change its value; 

- swap: randomly select two values Zi and zj, and propose to ex- 
change their values. 

In both cases, proposed moves from current value z to proposed value z' 
are accepted with the usual Metropolis-Hastings acceptance probability 



Update 7. For each k = 1, . . . ,Kmax, we differentiate the cases when 
Zk = and when Zk = 1: 

- if = then 7^. is updated according to its prior distribution, i.e. a 
uniform distribution on I^; 

- ii Zk = 1, 7fc is updated to a new value 7^, according to the posterior 
distribution 

An independence Metropolis-Hastings step can be used for this last type 
of updating, using the prior on 7^ as a proposal, with the corresponding 
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acceptance probability given by 

' T^{lk\lj^k,Z,Y) J ' 

Note that a rejection sampler may alternatively be used for this step, again 
using the prior on 7^ as the sampling distribution. This may be desirable in 
certain circumstances as the rejection sampler produces i.i.d. draws from the 
conditional posterior distribution of 7/^.. 

Once an MCMC sample {(2^*^ 7^*^)}j=i,...,7v is obtained, model inference 
proceeds following one of the two methods commonly used in such settings. 
The first method uses the maximum a posteriori (MAP) estimate for (z, 7) 

(z, 7) = argmax 7r(z'^*^ , 7*^*^ \Y), 

l<i<N 

and then calculate the corresponding least squares estimates (3z,j to give the 
curve estimate 

fix) = X.^^hn- (7) 

The second method uses a Bayesian model averaging approach (BMA) where 
the estimates for f{x) are averaged over different configurations of the aux- 
iliary variable z and their corresponding 7 values from the MCMC output. 
Since the conditional posterior expectation for (5 given z and 7 is, for large c, 

where j3z-j is the least squares estimate for (3 given z and 7, then an estimate 
for the curve can be obtained by 

1 ^ 

/"(^) = ]^E^^..7.4,7.- (8) 
1=1 



,1k) =mm 



2.3 Simulation studies 



We carry out simulation studies using the examples from Smith and Kohn (1996) 



DiMatteo et al. (2001) and Denison et al. (1998) We compare the performance 



of the methods of Smith and Kohn (1996) and Denison et al. (1998) with our 
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proposed method, and also discuss the selection of intervals Ik- In each exam- 
ple a cubic spline model is fitted by setting P = 3 in (|2]|. 



Example 1: In this example, taken from Smith and Kohn (1996)[ the true 
function takes the form 

f{x) = (?i(x, 0.15, 0.052)/4 + 0(2;, 0.6, 0.22)/4, x € [0,1] 

where (/){x, fi, a^) denotes the value at x of the normal density with mean 
and variance a^. Some 7i data points x are sampled from the uniform distri- 
bution [7(0, 1), and a zero-mean Gaussian noise e is added to the data, where 
e ~ A^(0, 0.25^). Sample sizes of n = 20 and n = 100 are studied. 



Example 2: In this example taken from Denison et al. (1998) the true func- 
tion is 

f{x) = sm{2x) + 2exp(-16x2), x G [-2,2]. 

This function is first rescaled so that the support is on the unit interval, and 
then evaluated at n points in [0, 1], generated from a f/(0, 1) distribution. A 
zero-mean Gaussian noise e is then added to the data, where e '-^ N{0, 0.3^). 
Sample sizes of 7i = 20 and n = 200 are studied. 



Example 3: This example is taken from DiMatteo et al. (2001) The true 
function is 

f{x) = sin(x) + 2exp(-30x^), x € [-2,2], 

evaluated at 7i regularly spaced grid points, and the variance of the noise is 
taken as a = 0.32. Again, we rescale to work on the unit interval for x. Sam- 
ple sizes of n = 20 and n = 101 are studied. 

To compare the different methods we use the mean squared error (MSE) as a 
measure of goodness of fit, given by 



1 

MSE = - j;{/(x,)-/(x,)}2 



n 

i=l 
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where / is the true function and / is the estimated function. For each example 
and for the three methods that are considered, MSEs for maximum a posteri- 
ori estimates and Bayesian model averaging estimates were calculated using 
Equations and (|8]|. Hereafter we refer to the data sizes of n = 100, 200, 101 
as large data sets, and n = 20 as small data sets. 

Concerning prior specifications, for each example the value c = n was 
used for the prior (H)) when computing for the larger data sets, and c = 200 for 



smaller data sets. As stated in Smith and Kohn (1996) the value of c should 
be between 10 to 1000, and in general we found values of around 100 to 500 
to give very stable results. For the truncated Poisson prior ^ we set A = 
3 and L = 10. We chose the Poisson parameter A to be 3 in the examples 
below, but results are largely insensitive to values of A around this range. The 
maximum number of knots allowed L is chosen to be large enough to not 
affect the simulation results here. 

For these examples we consider the situation where there is no prior infor- 
mation on the knot locations and chose the intervals 1^ to correspond to the 
ranges given by every sorted x values. We found that Ux = 4, 10 and 4 re- 
spectively were sufficient to provide a good fit in each of the three larger data 
set examples. For n = 20 we used = 2 in all three examples. In this case, the 



use of a B-spline basis to formulate the X-y matrix, as in DiMatteo et al. (2001) 



is required to avoid numerical instability (see e.g. Ruppert et al. 2003 1. In gen- 
eral, the choice of the size of the interval can depend on the data and there is a 
trade-off between computational time and accuracy, as sampler convergence 
is achieved more quickly for smaller number of intervals. 

Finally, for the MCMC computation of all three examples, starting with an 
arbitrary set of initial values generated from the prior distributions, we ran 
a burn-in of 500 iterations, followed by 1,000 recorded iterations, where each 
iteration involves an update of 20 z update steps for each 7 update step. Note 
that we found it to be more effective to increase the number of z updates, 
instead of increasing the total number of iterations, as 7 updates had very 
good mixing properties. To assess convergence, we monitored the trace plots 
of posterior values. We also ran much longer chains of 10,000 iterations and 
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found the results to be similar in terms of MSE calculations. This is perhaps 
not surprising since the posterior values suggested that the chains mixed very 
quickly. See Figure [T] for the fitted functions of the three examples using our 
method. 

Table [l] shows the MSEs for both the MAP and the BMA estimates using 



our method, the method of Smith and Kohn (1996) (using 1,000 iterations of 



MCMC updates and 500 burn in) and the method of |DiMatteo et al. (2001)] (us- 
ing 10,000 iterations and 1,000 burn in, as recommended in their paper). The 



method of DiMatteo et al. (2001) was tested using the BARS program available 
at |http : //wpicr . wpic . pitt . edu/WPICCompGen/bars . htm , For each 
example, estimates are calculated over 50 randomly generated data sets, re- 
spectively for both large and small data set sizes. See also Figure|2]for boxplots 
of these MSEs for the small data sets. 

In all examples, particularly for the smaller sample size of n = 20, the 



method presented in this paper clearly out performed the method of Smith and Kohn (1996) 



in both the MAP and BMA estimates. Our method is also very competitive 



with the method of DiMatteo et al. (2001) This is particularly noticeable in 
Example 1, where both our MAP and BMA estimates are marginally better, 
while in Examples 2 and 3 the MAPs generally performed better than BMA 



when compared to DiMatteo et al. (2001) The differences between the three 
methods for larger data sets are smaller, with our method out performing the 



method of Smith and Kohn (1996) by an order of between 10 ^ to 10 ^ in MSE 
estimates. 



Overall, our sampler clearly out performed the method of Smith and Kohn (1996) 



This gain in accuracy can be mainly attributed to the fact that our method al- 
lows a free knot selection procedure. Our sampler is also more efficient at find- 



ing the MAP estimate, resulting in smaller MSE estimates than DiMatteo et al. (2001) 



in general, while our corresponding BMA estimates are less accurate. This is 
perhaps unsurprising, since our algorithm contains more parameters, hence 
it would be difficult to visit every configuration the appropriate number of 
times. On the other hand, our algorithm is able to traverse the region of high 
density very quickly, hence obtaining an accurate MAP estimate in a relatively 
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short number of iterations. 

In terms of computation, there are two main differences between our method 



and that of Smith and Kohn (1996) Firstly, for the update of z, while Smith and Kohn (1996) 



cycle through each component of the z vector systematically, we randomly up- 
date a number of its components. For example, in the implementation of Ex- 
ample 1 with n = 100, we update the z vector 20 times compared to 25 times 
using Smith and Kohn (1996)[ clearly, the computational gain is greater in Ex- 
ample 2 with n = 200 when the length of the z vector in |Smith and Kohn (1996) 



is 50. Secondly, we have the additional update of the 7 parameters. However, 
this is a quick procedure, since it involves a simple sample from the prior dis- 
tribution for when there is no knot in the interval, and a Metropolis-Hastings 
update for when there is a knot, but the numbers in the latter are mostly small. 



A comparison with the method of DiMatteo et al. (2001) is more difficult, since 
they use a reversible jump scheme. So, in this regard, some users may find it 
simpler to work with our standard MCMC framework. We have also found 



that DiMatteo et al. (2001) required a much longer Markov chain to acheive 
convergence, particularly in terms of finding the MAP result, suggesting that 
there may be mixing issues. 



3 Extensions to non-Gaussian error models 

When the assumption of normality of ej in (|l]| is relaxed we can consider a 
model of the form 

li|xi,...,a;„ ~p(y|/(xi),cr*), i = l,...,n, (9) 

where fix) is still given by (|2]) and where a* denotes a potential nuisance 
parameter. The methodology used to fit the regression model can be employed 
in this more general setting, with the exception that we can no longer integrate 
out the (iz^-^ parameters analytically. 
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0.0 0.2 0.4 0.6 0.8 1.0 

X 

(c) Example 3 

Figure 1: Fitted cubic splines for n = 100, 200, 101 respectively for the three simu- 
lated examples. The three curves plotted are the true curve (solid line), estimated 
curves using MAP estimate (dash-dotted lines) and BMA estimate (dashed lines). 

3.1 Inference on non-Gaussian error models 

We need to add steps to update the values of the l3z,^ parameters in the MCMC 
sampler of Section |Z2l Here Zk = corresponds to rjk = 0, so we first propose 
to update the z and the Pz,^ parameters simultaneously and then propose k 
further updates of /3 to improve mixing. More precisely we use the following 
successive updates: 

• update z and Pz,'y '■ 

- Propose to update z to z' via either an add/ delete or a swap step as 
in Section IZ2i 

- Propose a new value ^ for the regression coefficients according to 
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Ex 1, MAP Ex 2, MAP Ex 3, MAP 




Ex 1, BMA Ex 2, BMA Ex 3, BMA 
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Figure 2: Simulation study. Boxplots corresponding to the MSEs presented in Table 
[T]for small sample sizes n = 20. 



an independence Metropolis-Hastings sampler. We use a multivari- 
ate Normal distribution N{(3z'^y, 6zT,z','y) as the proposal qz,z' , where 
(5z' is the MLE estimate of /3 and is the corresponding covari- 
ance matrix with respect to z' and 7. The move to (z', /3^, ^) is then 
accepted with probability 

<ii'z',r^'^l\^M^'Afi'z>a^i^^.i) \ 

otherwise, the chain remains at (z, iiz,-^)- 
update fiz.'-i '■ 

- If the update z and /3^,^ move above is accepted, then perform k > 
extra Metropolis-Hastings updates of /? using the multivariate Nor- 
mal distribution A^(/?2^^, (5/3Sz,^) as the proposal g/3,/3'. The moves 
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from P to (3' are accepted with probability 

. [, 7r(^^ 2;,7|y)g^/,/3(/?; ^^^,^) 
mm < 1 — — — — — 

I ' 7r(/3j;,^,2:,7 I y)g/3,/3'(/5^,7'/52,7) . 

• Update 7: 

- The corresponding 7 update step would remain the same as in Sec- 
tion 1221 

The values 6z and 6p can be tuned to optimise the mixing of the MCMC sam- 
pler in the usual way (see [Roberts and Rosenthal 200Hl . In update P, we per- 
form K > further steps of MCMC moves if the chain has moved to a new 
model to further facilitate mixing of the Markov chain, this step can be omit- 
ted for a longer overall MCMC chain. Note that for some applications, the 
computational cost of estimating the MLE of the likelihood may be similar to 
estimating the maximum a posteriori estimator of the posterior distribution. 
In this case we recommend the use of the latter to form the proposal distribu- 
tions since this give higher acceptance probabilities, see Example 14.21 

When the posterior differ greatly from the likelihood, making q^^z' in the 
update model move a poor proposal choice. In this situation, one may delay 
the rejection of the move from z to z' by making the k > additional update 
iP' — > /?*) moves first with respect to some distribution vr*, then carry out the 
accept/reject decision from {z, Pz,-t) to {z' , ^) with acceptance probability 



min < 1, 



Note that such a strategy is only beneficial when the moves /?' — > /3* are made 
with respect to a new distribution vr*, where the distribution vr* is chosen 
to facilitate moves towards the mode of the posterior distribution tt, conse- 
quently increasing the acceptance probability in the update model move. See 



Al-Awadhi et al. (2004) for further discussions on how to choose vr*. For the 



examples we studied, we did not find it necessary to make use of tt*, how- 



ever the reader is referred to Al-Awadhi et al. (2004) should mixing become 
an issue. 
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3.2 A simulated Poisson example 

In this section, we generate n = 500 Poisson random variables yi^i = 1, . . . , n 
from 

Hi ~ Poisson(exp {2xi + cos(47rxj)}) 

where rcj is uniformly sampled on the interval [0, 1]. We fit the curve Q for 
P = 1,2,3. We take the unit information prior c = n in (|4]) for the f3 parameter, 
setting c = 500, and for the truncated Poisson prior for z we take A = 1 and 
L = 10. We set the intervals Ik to be between consecutive numbers of the 
sequence 

(0.02, 0.1, 0.2, .0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.98). 

Note that the first and last interval are bounded away from the limits of the 
observed points, in order to avoid numerical problems which can sometimes 
occur with the specification of the prior @. 

Table |2] shows the average mean squared error calculations obtained from 
50 simulated datasets, together with the corresponding standard deviation. 
For each dataset, we ran our sampler for 5,000 iterations following 1,000 it- 
erations of burn-in. Here, for each update 7 step, 10 update z and (3z,'y steps 
are performed to obtain good mixing. For each update of z and I3z,'y we used 
K = 10 MCMC moves for Pz,^- Scaling parameters of the covariance matrices 
in the proposal distributions for the update of Pz,^ are Sz = Sj3 = 1/50. Note 
that the MAP and the BMA estimates here differ from Equations Q and ^ 
since the values of j3z,'y are not MLE plug-ins. Figured shows the fitted curves 
using the two estimators. The BMA estimates give a smoother curve estimate, 
particularly for P = I. 

Two alternative methods of updating the (3z,'y parameters have been used. 



One is to use the MLE plug-in estimates for the Pz,'yS as in Denison et al. (1998) 



where I3z,'y is not treated as a parameter in their Bayesian model. In imple- 
menting this method for this example, we found it is only slightly quicker 
than our MCMC update of Pz,'y, since for each update of Pz,^ the expense of 
estimating the MLEs is the same for both algorithms. Our method then in- 
cludes an additional k = 10 computationally inexpensive steps of Metropolis- 
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Hastings updates using the existing MLE estimates. In an alternative method, 
[DiMatteo et al. (2001) propose to use an importance sampler to calculate the 
expected values of the l^z-fS at each iteration. We implemented this method, 
using an importance sampling distribution based on the MLE estimates and 
the corresponding covariance matrix, to obtain 1,000 samples, and found this 
to be considerably slower than our method. The MSE estimates using both 
plug-in MLE and importance sampling were approximately the same as found 
in Table H 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 



X X 

(a) MAP (b) BMA 

Figure 3: Simulated Poisson example; fitted curves using MAP (left) and BMA 
(right) estimates for P = 1, 2, 3. 



4 Applications in change-point modelling 

Many change-point type problems can be converted to the curve fitting frame- 
work. In the following, we first show by example an explicit equivalence be- 
tween a change-point model and a linear regression spline, where one is inter- 
ested in retaining interpretation of the coefficients. We then give an example 
of change-point detection in the context of accurate seasonal modelling for an 
extreme rainfall problem. 
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4,1 Bayesian modelling of prehistoric tombs 

Consider the modelling of prehistoric corbelled domes (late Minoan Tholos 



data collected from Dimini in Crete; Cavanagh and Laxton 1982 1. Paired data 



arise in the form (dj, rj), where di represents the distance below the apex of the 
tomb with the corresponding radius measured at di. These data are thought 
to approximately follow a log-linear model between a series of change-points. 
The model is formulated as 

log(rj) = log(aj) + bj log(di + A^) + e^, if 7j_i < di < 

where Aj is the distance between the apex of the tomb and the begining of 
the measurement of depth dj. The change-points < 71 < . . . < 7/^ (with 
K unknown) and the parameters of the model are subject to the continuity 
constraints 

ajilj + ^j)^' = «i+i(7j + 

for j = 1, . . . , K — 1 and Si N{0, o"^). We are interested in making poste- 
rior inference on the number and location of the change-points, as well as the 
coefficients aj and bj, while retaining their parametric interpretations. 

Here, we restrict our interest only to the detection of the number and lo- 
cation of the change points. A more sophisticated model was considered 
from a Bayesian perspective by |Fan and Brooks (2000) where computation 



was carried out using the reversible jump MCMC algorithm of Green (1995) 
using split/ merge and birth/ death proposals for the transdimensional moves. 
The above representation of a change-point model can be equivalently re- 
expressed in our framework of Equation (|2]|, where the function /(x) is given 
by 

K 

f{x) = ao + aix + ^ + x/7fc)+, 



k=l 



where Xi = log{di + Aj), log(ai) = ao and log(aj) = ao - T.i=i "nkj > 1 and 
where 61 = ai and bj = qi + Yli~iVk/7k,j > 1- The corresponding design 
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matrix is given by 
/l 



1 



Xl ... (-l + xi/7i)_ 

X2 ■■■ (-I + X2/71)- 



{-l + XihK)+ ^ 
{-l + X2hK) + 

{-1 + XnhK)+ ) 



(10) 



\ 1 Xn ... (-1 + Xnhl)+ ■ 

This alternative design matrix allows us to retain interpretation on the regres- 
sion coefficients. For simplicity, we set the value of Aj = 0.54, j = 1, . . . , ET — 1, 
the value of posterior mean for these parameters found in the model with the 



highest posterior probability in Fan and Brooks (2000) Note that we could in- 
corporate the updating of the Aj parameters into our current algorithm. 

Since, in this example, the data consist of only n = 15 data points, we 
set the value of c = 500 in the prior specification of Equation (|4]| to reflect 
a vague prior. We take a trimcated Poisson prior for the number of change 
points with A = 1 and truncated at a maximum of L = 3 change points. Visual 
inspection of the data suggest that it would be sensible to place the interval for 
the occurrence of change points to be between the values 0.15, 0.8, 1.2, 1.6. We 
ran the MCMC sampler of Section l2!2l with 1,000 iterations of burn-in and 5,000 
iterations of post burn-in samples. To increase mixing, for each update of the 
7 parameter, we updated the auxiliary variable z 20 times. Trace plots of the 
posterior values and number of knots over the iterations are shown in Figure 
m Convergence appears to have been achieved after around 1,000 iterations in 
this example. 

Figure |5] shows the fitted curve using posterior modal estimates with a sin- 
gle change point found to be around 71 = 1.29. |Fan and Brooks (2000) found 
that the model with the highest posterior model probability contained one 
change point, with mean 1.32 on the log scale. Similarly, our MLE estimates 
for the remaining parameters are log (ai) = — 0.15(— 0.14), log(a2) = 0.37(0.41), 
hi = 0.96(0.95) and 62 = 0.56(0.54), with the posterior mean estimate in paren- 



theses quoted from Fan and Brooks (2000) for comparison. Finally, in terms of 



computation, the sampler used by [Fan and Brooks (2000) [ required 15,000,000 
MCMC iterations, as the continuity constraint posed a problem for mixing. 



See also Sisson and Fan (2007) for related discussion on mixing. 



18 







1000 



2000 



3000 



4000 



5000 







1000 



2000 



3000 



4000 



5000 



iterations 



iterations 



Figure 4: Traceplots of the posterior values and the number of change points for 
the Bayesian model of prehistoric tombs example. 

4.2 Modelling extreme rainfall 

We now consider the modelling of extreme levels of a sequence {Xt},t = 
1, ... of daily rainfall measurements. Following standard arguments from 
extreme value theory (e.g. IColes 2001]) for a large enough threshold, u, the 
distribution of threshold exceedances, Yt = Xt — u, conditional upon Xt > u, 
approximately follows a generalised Pareto distribution 



defined on {yt : yt > and (1 + Ctyt/cft) > 0}. Time-dependent parameters at 
and £,t respectively determine scale and shape (through the rate of tail decay). 

We consider extreme daily rainfall levels recorded at Maiquetia Interna- 
tional Airport, Venezuela, for the period 1961-1999. Particular interest in this 
series arises through an event in December 1999 which was almost three times 




(11) 
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MAP estimates 




log(d) 

Figure 5: Fitted curves using MAP for data from Dimini (with 95% pointwise pre- 
diction interval). 

greater than any previously recorded rainfall (Figured open circle). In a pre- 
vious analysis, |Sisson et al. (2006) modelled within-year seasonal variations 



using constant scale and shape parameters between seasonal change-points, 
where the number and location of change-points was unknown. Accurate 
modelling of seasonal variability was demonstrated to be crucial in terms of 
making realistic predictions concerning the December 1999 event. The analy- 



sis of Sisson et al. (2006) implemented reversible jump MCMC with split / merge 
steps for between-model transitions. Between-model chain mixing was gener- 
ally poor, necessitating long chain runs to ensure accurate posterior inference. 

Here we model within-year variations by expressing both at and as first- 
order curves f{x) (Equation ^ with P = I) ,where t = 1, . . . , 366 now specifi- 
cally denotes the day of the year. The (unknown) location and number of knot 
points correspond to variations in the underlying seasonal climate. As any 
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Rainfall from Malquetia, Venezuela (1961-1999) 
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Figure 6: Daily rainfall measurements exceeding u = 10 mm recorded at Maiquetia Inter- 
national Airport, Venezuela during 1961-1999. Open circle indicates December 1999 event 
(not used to fit model). Curves denote (pointwise) posterior means and 95% credibility 
regions for at and (,t, and 50-year return levels (curves linearly scaled for visualisation 
purposes). Crosses denote MAP change-point estimates for two-season model under anal- 
ysis of Sisson et al. (2006). 

temporal fluctuations in the distribution of rainfall extremes can reasonably 
be expected to affect both location and scale parameters simultaneously, we 
express both at and ^t as functions of the same 7 and z variables, but allow 
different /3z,"f coefficients. Given that the last day in the year is temporally 
adjacent to the first day of the following year, the model requires curve con- 
tinuity at the yearly end points. This is achieved by imposing the constraints 
Co = C366 and the first derivatives ^ \t=o = ^ \t=366 (and similarly for ^t)- 
Specifically, this amounts to 



K 



K-l 




and 




k=l 



k=l 



for both at and $,t (the indexing of Pz,-/ coefficients on at and S,t is suppressed 
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for clarity). 

The non-Gaussianity of the model means we are unable to analytically in- 
tegrate out the I3z,-f coefficients, and so we implement the algorithm in Section 
13.11 In all, 5,000 MCMC iterations were obtained following 1,000 iterations 
burnin, for each iteration we perform 10 updates of 7, and use k = 10 each 
update of z. Here, as maximum likelihood estimates of I3z,y under the gener- 
alised Pareto distribution require numerical optimisation of the likelihood, we 
modified the MLE estimate to be the maximum a posteriori estimates for im- 
proved sampler efficiency for the same computational effort. The covariance 
matrix scaling factor for the f^z^-y proposal updates was set to 5^ = l,Sf^ = 1/10. 
Prior specification was A = 1, c = n, L = 10, and 10 equally spaced intervals 
over the range 1 to 366 were used. 

Figure [6] displays the rainfall measurements plotted against the day of the 
year, with pointwise posterior means and 95% credibility intervals for shape 
and scale parameters (scaled linearly for visualisation purposes). Also shown 
is the pointwise posterior predictive mean 50-year return level, defined as the 
rainfall level that is exceeded on average once every 50 years. Following from 
(|TT|| this may be obtained as the value Z50 that is the solution of 



where Uy = 365.25 is the average number of observations per year and = 
Pr{Xt > u) is the probability that an individual observation exceeds the thresh- 
old, u. 

The low return level around the middle of the year (in the "wet" season) 
corresponds to relatively low shape and scale parameters for this period, while 
conversely the high return level (in the "dry" season) corresponds to relatively 
high (Jt and £,f The timing of changes in the tail behaviour of the fitted Pareto 
density (as evidenced by variations in the 50-year return level) corresponds 
well with previously identified MAP changepoints (indicated by x 's in Figure 
|6]l for a two-seasonal model dSisson et al. 20061) . The computation required for 
this inference was considerably less than for the earlier analysis. 
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5 Discussion 



This article focuses on the auxiliary variable approach to the fitting of curves. 
This approach allows us to compute for the unknown number and location 
of the knots, via a Metropolis-within-Gibbs sampler. We have adopted the 
use of a spline regression model of the form ((2]). However, more sophisticated 
expressions can be found for curves (see for example IDenison et al. 1998|l , to 
which the methods described here easily extend. 

Our method depends, to some extent, on the specification of the intervals 
Ik,k = 1, . . . Kmax in which knots 7 may be found. The advantage of our 
approach over [Smith and Kohn (1996) is that it gives a more accurate infer- 
ence, particularly for small data sets. For instance in Example 14. 1[ the MAP 
change-point is found to be between two data points while the method of 
[Smith and Kohn (1996) does not allow for this location. In all the examples 
presented in this paper we have only used non-overlapping intervals. How- 
ever, it is possible to allow overlapping intervals using, for example, an or- 
dering constraint on the values of 7. We have also shown via simulated data 



sets that our method compares well with the method of DiMatteo et al. (2001) 
which uses the reversible jump approach. 

We have also provided a new Metropolis-within-Gibbs sampler algorithm 
to fit the regression model when the Gaussian error assumption is relaxed. In 
this case our sampler needs to include an additional step for the computation 
of the spline coefficients. In particular, we advocate the use of MLEs in the 
construction of a proposal distribution for the coefficients when moving to a 
new model. 

Finally, we revisited two real examples of Bayesian change point detec- 
tion, and showed that these types of problems may be converted to the vari- 
able selection setting, hence making use of the auxiliary variable approach. 
Many Bayesian change point analyses with unknown number and location of 
change points are computed with the use of complex implementations of the 
reversible jump algorithm (for example involving, split/ merge and birth/ death 
moves), as were the cases for the original analyses in Section HI Although the 
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reversible jump samplers can handle more complex, non-standard problems, 
we have found that our approach here is far simpler to implement. We were 
able to use standard statistical software R ( [Venables and Ripley 2005] l to im- 
plement both examples very efficiently. 

6 Supplemental materials 

The following supplemental materials are made available online. 

Data and Computer Code R programs to run the algorithms described in this 
article. All simulated data sets and real data sets used in the examples 
are also included. Please refer to the README files in the relevant direc- 
tories for instructions, (curves.tar.zip, tarred zip file) 
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FDS 


SK 


DGK 


Example 1 


n = 


20 


MAP 


0.0355 


0.0531 


0.0387 










(0.0169) 


(0.0351) 


(0.0203) 








BMA 


0.0317 


0.0433 


0.0335 










(0.0135) 


(0.0199) 


(0.0164) 




n = 


100 


MAP 


0.0072 


0.0078 


0.0078 










(0.0036) 


(0.0038) 


(0.0041) 








BMA 


0.0066 


0.0073 


0.0060 










(0.0032) 


(0.0034) 


(0.0030) 


Example 2 


n = 


20 


MAP 


0.0631 


0.0961 


0.0664 










(0.0288) 


(0.0379) 


(0.0273) 








BMA 


0.0638 


0.0837 


0.0534 










(0.0289) 


(0.0308) 


(0.0233) 




n = 


200 


MAP 


0.0070 


0.0088 


0.0068 










(0.0029) 


(0.0057) 


(0.0027) 








BMA 


0.0061 


0.0076 


0.0057 










(0.0022) 


(0.0029) 


(0.0022) 


Example 3 


n = 


20 


MAP 


0.0936 


0.1262 


0.0916 










(0.0468) 


(0.0308) 


(0.0439) 








BMA 


0.1012 


0.1093 


0.0772 










(0.0369) 


(0.0244) 


(0.0359) 




n = 


101 


MAP 


0.0123 


0.0134 


0.0116 










(0.0068) 


(0.0069) 


(0.0055) 








BMA 


0.0116 


0.0133 


0.0099 










(0.0061) 


(0.0060) 


(0.0056) 



Table 1: Simulation study. Mean MSEs with estimated standard errors in brack- 
ets based on 50 samples obtained using the maximum a posteriori (MAP) and 
Bayesian model averaging (BMA) estimates for each of the three methods: FDS 
(method presented in this paper), SK (method of Smith and Kohn 1996) and DGK 
(method of DiMatteo et al. 2001). 
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P = 1 


P = 2 


P = 3 


MAP 
BMA 


0.3659 (0.0959) 
0.2712 (0.0977) 


0.1647 (0.0725) 
0.1626 (0.0918) 


0.1176 (0.0648) 
0.1117(0.0671) 



Table 2: Simulation study for the Poisson example. Average MSEs with estimated 
standard errors in brackets based on 50 samples, obtained using maximum a pos- 
teriori (MAP) and Bayesian model averaging (BMA) estimates. 
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